Preparing data

Type conversion

Types of variables in R

As in other programming languages, R is capable of storing data in many different formats, most of which you’ve probably seen by now.

Loosely speaking, the class() function tells you what type of object you’re working with. (There are subtle differences between the class, type, and mode of an object, but these distinctions are beyond the scope of this course.)

# Make this evaluate to "character"
class("TRUE")
## [1] "character"
# Make this evaluate to "numeric"
class(8484.00)
## [1] "numeric"
# Make this evaluate to "integer"
class(99L)
## [1] "integer"
# Make this evaluate to "factor"
class(as.factor("factor"))
## [1] "factor"
# Make this evaluate to "logical"
class(FALSE)
## [1] "logical"

Common type conversions

It is often necessary to change, or coerce, the way that variables in a dataset are stored. This could be because of the way they were read into R (with read.csv(), for example) or perhaps the function you are using to analyze the data requires variables to be coded a certain way.

Only certain coercions are allowed, but the rules for what works are generally pretty intuitive. For example, trying to convert a character string to a number gives an error: as.numeric("some text").

There are a few less intuitive results. For example, under the hood, the logical values TRUE and FALSE are coded as 1 and 0, respectively. Therefore, as.logical(1) returns TRUE and as.numeric(TRUE) returns 1.

# Read students data
library(readr)
students <- read_csv("../xDatasets/students_with_dates.csv")
## Warning: Missing column names filled in: 'X1' [1]
# Preview students with str()
str(students, give.attr = FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 395 obs. of  33 variables:
##  $ X1         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ school     : chr  "GP" "GP" "GP" "GP" ...
##  $ sex        : chr  "F" "F" "F" "F" ...
##  $ dob        : Date, format: "2000-06-05" "1999-11-25" ...
##  $ address    : chr  "U" "U" "U" "U" ...
##  $ famsize    : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus    : chr  "A" "T" "T" "T" ...
##  $ Medu       : num  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu       : num  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob       : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob       : chr  "teacher" "other" "other" "services" ...
##  $ reason     : chr  "course" "course" "other" "home" ...
##  $ guardian   : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime : num  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime  : num  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures   : num  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup  : chr  "yes" "no" "yes" "no" ...
##  $ famsup     : chr  "no" "yes" "no" "yes" ...
##  $ paid       : chr  "no" "no" "yes" "yes" ...
##  $ activities : chr  "no" "no" "no" "yes" ...
##  $ nursery    : chr  "yes" "no" "yes" "yes" ...
##  $ higher     : chr  "yes" "yes" "yes" "yes" ...
##  $ internet   : chr  "no" "yes" "yes" "yes" ...
##  $ romantic   : chr  "no" "no" "no" "yes" ...
##  $ famrel     : num  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime   : num  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout      : num  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc       : num  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc       : num  1 1 3 1 2 2 1 1 1 1 ...
##  $ health     : num  3 3 3 5 5 5 3 1 1 5 ...
##  $ nurse_visit: POSIXct, format: "2014-04-10 14:59:54" "2015-03-12 14:59:54" ...
##  $ absences   : num  6 4 10 2 4 10 0 6 0 0 ...
##  $ Grades     : chr  "5/6/6" "5/5/6" "7/8/10" "15/14/15" ...
# Coerce Grades to character
students$Grades <- as.character(students$Grades)

# Coerce Medu to factor
students$Medu <- as.factor(students$Medu)

# Coerce Fedu to factor
students$Fedu <- as.factor(students$Fedu)
    
# Look at students once more with str()
str(students, give.attr = FALSE)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 395 obs. of  33 variables:
##  $ X1         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ school     : chr  "GP" "GP" "GP" "GP" ...
##  $ sex        : chr  "F" "F" "F" "F" ...
##  $ dob        : Date, format: "2000-06-05" "1999-11-25" ...
##  $ address    : chr  "U" "U" "U" "U" ...
##  $ famsize    : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus    : chr  "A" "T" "T" "T" ...
##  $ Medu       : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 5 4 5 3 5 4 4 ...
##  $ Fedu       : Factor w/ 5 levels "0","1","2","3",..: 5 2 2 3 4 4 3 5 3 5 ...
##  $ Mjob       : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob       : chr  "teacher" "other" "other" "services" ...
##  $ reason     : chr  "course" "course" "other" "home" ...
##  $ guardian   : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime : num  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime  : num  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures   : num  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup  : chr  "yes" "no" "yes" "no" ...
##  $ famsup     : chr  "no" "yes" "no" "yes" ...
##  $ paid       : chr  "no" "no" "yes" "yes" ...
##  $ activities : chr  "no" "no" "no" "yes" ...
##  $ nursery    : chr  "yes" "no" "yes" "yes" ...
##  $ higher     : chr  "yes" "yes" "yes" "yes" ...
##  $ internet   : chr  "no" "yes" "yes" "yes" ...
##  $ romantic   : chr  "no" "no" "no" "yes" ...
##  $ famrel     : num  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime   : num  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout      : num  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc       : num  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc       : num  1 1 3 1 2 2 1 1 1 1 ...
##  $ health     : num  3 3 3 5 5 5 3 1 1 5 ...
##  $ nurse_visit: POSIXct, format: "2014-04-10 14:59:54" "2015-03-12 14:59:54" ...
##  $ absences   : num  6 4 10 2 4 10 0 6 0 0 ...
##  $ Grades     : chr  "5/6/6" "5/5/6" "7/8/10" "15/14/15" ...

Working with dates

Dates can be a challenge to work with in any programming language, but thanks to the lubridate package, working with dates in R isn’t so bad. Since this course is about cleaning data, we only cover the most basic functions from lubridate to help us standardize the format of dates and times in our data.

These functions combine the letters y, m, d, h, m, s, which stand for year, month, day, hour, minute, and second, respectively. The order of the letters in the function should match the order of the date/time you are attempting to read in, although not all combinations are valid. Notice that the functions are “smart” in that they are capable of parsing multiple formats.

install.packages("lubridate")
# Read students data
library(readr)
students2 <- read_csv("../xDatasets/students_with_dates.csv")
## Warning: Missing column names filled in: 'X1' [1]
# Preview students2 with str()
#str(students2)

# Load the lubridate package
library(lubridate)

# Parse as date
dmy("17 Sep 2015")
## [1] "2015-09-17"
# Parse as date and time (with no seconds!)
mdy_hm("July 15, 2012 12:56")
## [1] "2012-07-15 12:56:00 UTC"
# Coerce dob to a date (with no time)
students2$dob <- ymd(students2$dob)

# Coerce nurse_visit to a date and time
students2$nurse_visit <- ymd_hms(students2$nurse_visit)
    
# Look at students2 once more with str()
str(students2, give.attr = FALSE, vec.len = 8)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 395 obs. of  33 variables:
##  $ X1         : num  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
##  $ school     : chr  "GP" "GP" "GP" "GP" "GP" "GP" "GP" "GP" ...
##  $ sex        : chr  "F" "F" "F" "F" "F" "M" "M" "F" ...
##  $ dob        : Date, format: "2000-06-05" "1999-11-25" ...
##  $ address    : chr  "U" "U" "U" "U" "U" "U" "U" "U" ...
##  $ famsize    : chr  "GT3" "GT3" "LE3" "GT3" "GT3" "LE3" "LE3" "GT3" ...
##  $ Pstatus    : chr  "A" "T" "T" "T" "T" "T" "T" "A" ...
##  $ Medu       : num  4 1 1 4 3 4 2 4 3 3 4 2 4 4 2 4 4 3 3 4 ...
##  $ Fedu       : num  4 1 1 2 3 3 2 4 2 4 4 1 4 3 2 4 4 3 2 3 ...
##  $ Mjob       : chr  "at_home" "at_home" "at_home" "health" "other" "services" "other" "other" ...
##  $ Fjob       : chr  "teacher" "other" "other" "services" "other" "other" "other" "teacher" ...
##  $ reason     : chr  "course" "course" "other" "home" "home" "reputation" "home" "home" ...
##  $ guardian   : chr  "mother" "father" "mother" "mother" "father" "mother" "mother" "mother" ...
##  $ traveltime : num  2 1 1 1 1 1 1 2 1 1 1 3 1 2 1 1 1 3 1 1 ...
##  $ studytime  : num  2 2 2 3 2 2 2 2 2 2 2 3 1 2 3 1 3 2 1 1 ...
##  $ failures   : num  0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 ...
##  $ schoolsup  : chr  "yes" "no" "yes" "no" "no" "no" "no" "yes" ...
##  $ famsup     : chr  "no" "yes" "no" "yes" "yes" "yes" "no" "yes" ...
##  $ paid       : chr  "no" "no" "yes" "yes" "yes" "yes" "no" "no" ...
##  $ activities : chr  "no" "no" "no" "yes" "no" "yes" "no" "no" ...
##  $ nursery    : chr  "yes" "no" "yes" "yes" "yes" "yes" "yes" "yes" ...
##  $ higher     : chr  "yes" "yes" "yes" "yes" "yes" "yes" "yes" "yes" ...
##  $ internet   : chr  "no" "yes" "yes" "yes" "no" "yes" "yes" "no" ...
##  $ romantic   : chr  "no" "no" "no" "yes" "no" "no" "no" "no" ...
##  $ famrel     : num  4 5 4 3 4 5 4 4 4 5 3 5 4 5 4 4 3 5 5 3 ...
##  $ freetime   : num  3 3 3 2 3 4 4 1 2 5 3 2 3 4 5 4 2 3 5 1 ...
##  $ goout      : num  4 3 2 2 2 2 4 4 2 1 3 2 3 3 2 4 3 2 5 3 ...
##  $ Dalc       : num  1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 ...
##  $ Walc       : num  1 1 3 1 2 2 1 1 1 1 2 1 3 2 1 2 2 1 4 3 ...
##  $ health     : num  3 3 3 5 5 5 3 1 1 5 2 4 5 3 3 2 2 4 5 5 ...
##  $ nurse_visit: POSIXct, format: "2014-04-10 14:59:54" "2015-03-12 14:59:54" ...
##  $ absences   : num  6 4 10 2 4 10 0 6 0 0 0 4 2 2 0 4 6 4 16 4 ...
##  $ Grades     : chr  "5/6/6" "5/5/6" "7/8/10" "15/14/15" "6/10/10" "15/15/15" "12/12/11" "6/5/6" ...

String manipulation

install.packages("stringr")

Trimming and padding strings

One common issue that comes up when cleaning data is the need to remove leading and/or trailing white space. The str_trim() function from stringr makes it easy to do this while leaving intact the part of the string that you actually want.

str_trim(" this is a test ")
[1] "this is a test"

A similar issue is when you need to pad strings to make them a certain number of characters wide. One example is if you had a bunch of employee ID numbers, some of which begin with one or more zeros. When reading these data in, you find that the leading zeros have been dropped somewhere along the way (probably because the variable was thought to be numeric and in that case, leading zeros would be unnecessary.)

str_pad("24493", width = 7, side = "left", pad = "0")
[1] "0024493"

# Load the stringr package
library(stringr)

# Trim all leading and trailing whitespace
str_trim(c("   Filip ", "Nick  ", " Jonathan"))
## [1] "Filip"    "Nick"     "Jonathan"
# Pad these strings with leading zeros
str_pad(c("23485W", "8823453Q", "994Z"), width = 9, side = "left", pad = "0")
## [1] "00023485W" "08823453Q" "00000994Z"

Examples like this are certainly handy in R. For example, the str_pad() function is useful when importing a dataset with US zip codes. Occasionally R will drop the leading 0 in a zipcode, thinking it’s numeric.

Upper and lower case

In addition to trimming and padding strings, you may need to adjust their case from time to time. Making strings uppercase or lowercase is very straightforward in (base) R thanks to toupper() and tolower(). Each function takes exactly one argument: the character string (or vector/column of strings) to be converted to the desired case.

# state abbreviations 
states <- c("al", "ak", "az", "ar", "ca", "co", "ct", "de", "fl", "ga", "hi", "id", "il", "in", "ia", "ks", "ky", "la", "me", "md", "ma", "mi", "mn", "ms", "mo", "mt", "ne", "nv", "nh", "nj", "nm", "ny", "nc", "nd", "oh", "ok", "or", "pa", "ri", "sc", "sd", "tn", "tx", "ut", "vt", "va", "wa", "wv", "wi", "wy")

# Make states all uppercase and save result to states_upper
states_upper <- toupper(states)

# Make states_upper all lowercase again
tolower(states_upper)
##  [1] "al" "ak" "az" "ar" "ca" "co" "ct" "de" "fl" "ga" "hi" "id" "il" "in"
## [15] "ia" "ks" "ky" "la" "me" "md" "ma" "mi" "mn" "ms" "mo" "mt" "ne" "nv"
## [29] "nh" "nj" "nm" "ny" "nc" "nd" "oh" "ok" "or" "pa" "ri" "sc" "sd" "tn"
## [43] "tx" "ut" "vt" "va" "wa" "wv" "wi" "wy"

Finding and replacing strings

The stringr package provides two functions that are very useful for finding and/or replacing patterns in strings: str_detect() and str_replace().

Like all functions in stringr, the first argument of each is the string of interest. The second argument of each is the pattern of interest. In the case of str_detect(), this is the pattern we are searching for. In the case of str_replace(), this is the pattern we want to replace. Finally, str_replace() has a third argument, which is the string to replace with.

str_detect(c("banana", "kiwi"), "a")
[1] TRUE FALSE

str_replace(c("banana", "kiwi"), "a", "o")
"bonana" "kiwi"

The data.frame students2 is already available for you in the workspace. stringr is already loaded. students3 is a copy of it for you to work on so you can always start from scratch if you happen to make a mistake.

# Copy of students2: students3
students3 <- students2

# Look at the head of students3
students3 %>%
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
X1 school sex dob address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health nurse_visit absences Grades
1 GP F 2000-06-05 U GT3 A 4 4 at_home teacher course mother 2 2 0 yes no no no yes yes no no 4 3 4 1 1 3 2014-04-10 14:59:54 6 5/6/6
2 GP F 1999-11-25 U GT3 T 1 1 at_home other course father 1 2 0 no yes no no no yes yes no 5 3 3 1 1 3 2015-03-12 14:59:54 4 5/5/6
3 GP F 1998-02-02 U LE3 T 1 1 at_home other other mother 1 2 3 yes no yes no yes yes yes no 4 3 2 2 3 3 2015-09-21 14:59:54 10 7/8/10
4 GP F 1997-12-20 U GT3 T 4 2 health services home mother 1 3 0 no yes yes yes yes yes yes yes 3 2 2 1 1 5 2015-09-03 14:59:54 2 15/14/15
5 GP F 1998-10-04 U GT3 T 3 3 other other home father 1 2 0 no yes yes no yes yes no no 4 3 2 1 2 5 2015-04-07 14:59:54 4 6/10/10
6 GP M 1999-06-16 U LE3 T 4 3 services other reputation mother 1 2 0 no yes yes yes yes yes yes no 5 4 2 1 2 5 2013-11-15 14:59:54 10 15/15/15
# Detect all dates of birth (dob) in 1997, print 10 first results
str_detect(students3$dob, "1997")[1:10]
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
# In the sex column, replace "F" with "Female" ...
students3$sex <- str_replace(students3$sex, "F", "Female") 

# ... and "M" with "Male"
students3$sex <- str_replace(students3$sex, "M", "Male") 

# View the tail of students3
students3 %>%
  tail(8) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
X1 school sex dob address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health nurse_visit absences Grades
388 MS Female 1999-05-10 R GT3 T 2 3 services other course mother 1 3 1 no no no yes no yes yes no 5 4 2 1 2 5 2014-12-30 14:59:54 0 7/5/0
389 MS Female 1999-11-19 U LE3 T 3 1 teacher services course mother 1 2 0 no yes yes no yes yes yes no 4 3 4 1 1 1 2014-08-11 14:59:54 0 7/9/8
390 MS Female 1999-07-04 U GT3 T 1 1 other other course mother 2 2 1 no no no yes yes yes no no 1 1 1 1 1 5 2013-12-09 14:59:54 0 6/5/0
391 MS Male 1998-06-06 U LE3 A 2 2 services services course other 1 2 2 no yes yes no yes yes no no 5 5 4 4 5 4 2015-08-06 14:59:54 11 9/9/9
392 MS Male 2000-04-04 U LE3 T 3 1 services services course mother 2 1 0 no no no no no yes yes no 2 4 5 3 4 2 2014-09-01 14:59:54 3 14/16/16
393 MS Male 2000-02-07 R GT3 T 1 1 other other course other 1 1 3 no no no no no yes no no 5 5 3 3 3 3 2015-03-15 14:59:54 3 10/8/7
394 MS Male 1999-09-05 R LE3 T 3 2 services other course mother 3 1 0 no no no no no yes yes no 4 4 1 3 4 5 2015-06-12 14:59:54 0 11/12/10
395 MS Male 1999-01-27 U LE3 T 1 1 other at_home course father 1 1 0 no no no no yes yes yes no 3 2 3 3 3 5 2015-05-31 14:59:54 5 8/9/9

Missing and special values

Finding missing values

As you’ve seen, missing values in R should be represented by NA, but unfortunately you will not always be so lucky. Before you can deal with missing values, you have to find them in the data.

If missing values are properly coded as NA, the is.na() function will help you find them. Otherwise, if your dataset is too big to just look at the whole thing, you may need to try searching for some of the usual suspects like "", "#N/A", etc. You can also use the summary() and table() functions to turn up unexpected values in your data.

In this exercise, we’ve created a simple dataset called social_df that has 3 pieces of information for each of four friends:

Name
Number of friends on a popular social media platform
Current “status” on the platform

# Create small Social data frame
name <- c("Sarah", "Tom", "David", "Alice")
n_friends <- c(244, NA, 145, 43)
status <- c("Going out!", "", "Movie night...", "")
social_df <- data.frame(name, n_friends, status)

# Call is.na() on the full social_df to spot all NAs
is.na(social_df)
##       name n_friends status
## [1,] FALSE     FALSE  FALSE
## [2,] FALSE      TRUE  FALSE
## [3,] FALSE     FALSE  FALSE
## [4,] FALSE     FALSE  FALSE
# Use the any() function to ask whether there are any NAs in the data
any(is.na(social_df))
## [1] TRUE
# View a summary() of the dataset
summary(social_df)
##     name     n_friends                status 
##  Alice:1   Min.   : 43.0                 :2  
##  David:1   1st Qu.: 94.0   Going out!    :1  
##  Sarah:1   Median :145.0   Movie night...:1  
##  Tom  :1   Mean   :144.0                     
##            3rd Qu.:194.5                     
##            Max.   :244.0                     
##            NA's   :1
# Call table() on the status column
table(social_df$status)
## 
##                    Going out! Movie night... 
##              2              1              1

Scanning your dataset for NA values is essential before learning how to remedy missing data problems.

Dealing with missing values

Missing values can be a rather complex subject, but here we’ll only look at the simple case where you are simply interested in normalizing and/or removing all missing values from your data. For more information on why this is not always the best strategy, search online for “missing not at random.”

Looking at the social_df dataset again, we asked around a bit and figured out what’s causing the missing values that you saw in the last exercise. Tom doesn’t have a social media account on this particular platform, which explains why his number of friends and current status are missing (although coded in two different ways). Alice is on the platform, but is a passive user and never sets her status, hence the reason it’s missing for her.

The stringr package is preloaded.

# Replace all empty strings in status with NA
social_df$status[social_df$status == ""] <- NA

# Print social_df to the console
social_df
##    name n_friends         status
## 1 Sarah       244     Going out!
## 2   Tom        NA           <NA>
## 3 David       145 Movie night...
## 4 Alice        43           <NA>
# Use complete.cases() to see which rows have no missing values
complete.cases(social_df)
## [1]  TRUE FALSE  TRUE FALSE
# Use na.omit() to remove all rows with any missing values
na.omit(social_df)
##    name n_friends         status
## 1 Sarah       244     Going out!
## 3 David       145 Movie night...

Often times in data analyses, you’ll want to get a feel for how many complete observations you have. This can be helpful in determining how you handle observations with missing data points.

Outliers and obvious errors

# Simulate some data with three outliers
set.seed(10)
x <- c(rnorm(30, mean = 15, sd = 5), -5, 28, 35)

# View boxplot
boxplot(x, horizontal = TRUE)

Dealing with outliers and obvious errors

When dealing with strange values in your data, you often must decide whether they are just extreme or actually erroneous. Extreme values show up all over the place, but you, the data analyst, must figure out when they are plausible and when they are not.

We have loaded a dataset called students3, which is another slight variation of the original students dataset. Two variables appear to have suspicious values: age and absences. Let’s explore these values further.

# Read students data
students3 <- read_csv("../xDatasets/students_with_dates.csv")
## Warning: Missing column names filled in: 'X1' [1]
# Simulate AGE and ABSCENCES variables
students3$age <- sample(15:40, size = nrow(students3), replace = TRUE)

# Look at a summary() of students3
sum_students3 <- as.data.frame(do.call(cbind, lapply(students3, summary)))

sum_students3[,-1] %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
school sex dob address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health nurse_visit absences Grades age
Min. 395 395 9802 395 395 395 0 0 395 395 395 395 1 1 0 395 395 395 395 395 395 395 395 1 1 1 1 1 1 1382972394 0 395 15
1st Qu. character character 10169 character character character 2 2 character character character character 1 1 0 character character character character character character character character 4 3 2 1 1 3 1396839594 0 character 21
Median character character 10576 character character character 3 2 character character character character 1 2 0 character character character character character character character character 4 3 3 1 2 4 1410793194 4 character 28
Mean 395 395 10529.9291139 395 395 395 2.74936708860759 2.52151898734177 395 395 395 395 1.44810126582278 2.03544303797468 0.334177215189873 395 395 395 395 395 395 395 395 3.94430379746835 3.23544303797468 3.10886075949367 1.48101265822785 2.29113924050633 3.55443037974684 1412919071.46835 5.70886075949367 395 27.6405063291139
3rd Qu. character character 10893 character character character 4 3 character character character character 2 2 0 character character character character character character character character 5 4 4 2 3 5 1428461994 8 character 34
Max. character character 11255 character character character 4 4 character character character character 4 4 3 character character character character character character character character 5 5 5 5 5 5 1444921194 75 character 40
# View a histogram of the age variable
hist(students3$age)

# View a histogram of the absences variable
hist(students3$absences)

# View a histogram of absences, but force zeros to be bucketed to the right of zero
hist(students3$absences, right = FALSE)

As you can see, a simple histogram, displaying the distribution of a variable’s values across all the observations can be key to identifying potential outliers as early as possible.

Another look at strange values

Another useful way of looking at strange values is with boxplots. Simply put, boxplots draw a box around the middle 50% of values for a given variable, with a bolded horizontal line drawn at the median. Values that fall far from the bulk of the data points (i.e. outliers) are denoted by open circles. (If you’re curious about the exact formula for determining what is “far”, check out ?hist.)

# View a boxplot of age
boxplot(students3$age)

# View a boxplot of absences
boxplot(students3$absences)